05_model

library("tidyverse")
library("readr")
library("broom")
source("99_proj_func.R") # custom functions and theme

Load data

We load the data that was augmented for modelling:

load("../data/04_data_for_modelling.Rdata")

Modelling tumor size as a function of variables

Firstly a preliminary analysis of linear model of variables against tumor size as an indicator of disease progression is done.

df_long_nested_model <- df_long_nested |>
  mutate(model_object = map(data,
                            ~lm(size_of_primary_tumor ~ value,
                                data = .x))) |> 
  mutate(tidy_model = map(model_object, ~tidy(.x,
                                              conf.int = TRUE,
                                              conf.level = 0.95))) |> 
  unnest(cols = tidy_model) |> 
  filter(term == "value") |>  #discard intercept terms
  mutate(p_adjusted = p.adjust(p.value)) |> # correct for multiple testing
  arrange(p_adjusted) #sort pvalues to reveal highest significance

df_long_nested_model |> 
  select(-data, -model_object, -term, -std.error, -statistic) |>  
  print()
# A tibble: 16 × 6
# Groups:   variable [16]
   variable                      estimate  p.value conf.low conf.high p_adjusted
   <chr>                            <dbl>    <dbl>    <dbl>     <dbl>      <dbl>
 1 combined_index                 2.32    7.08e-18   1.81      2.83     7.08e-18
 2 stage                          6.71    1.67e- 9   4.57      8.86     1.67e- 9
 3 bone_metastases                8.97    2.90e- 9   6.06     11.9      2.90e- 9
 4 serum_prostatic_acid_phospha…  0.116   2.63e- 5   0.0623    0.170    2.63e- 5
 5 follow_up_months              -0.0955  5.96e- 5  -0.142    -0.0492   5.96e- 5
 6 is_alive                       4.43    2.95e- 4   2.04      6.81     2.95e- 4
 7 medical_history               -2.66    1.84e- 2  -4.87     -0.450    1.84e- 2
 8 serum_hemoglobin              -0.663   2.03e- 2  -1.22     -0.103    2.03e- 2
 9 performance                    2.68    4.24e- 2   0.0922    5.28     4.24e- 2
10 dia_blood_pressure            -0.467   2.24e- 1  -1.22      0.286    2.24e- 1
11 weight                        -0.0472  2.63e- 1  -0.130     0.0355   2.63e- 1
12 sys_blood_pressure             0.222   3.37e- 1  -0.231     0.675    3.37e- 1
13 ekg                            0.158   4.20e- 1  -0.226     0.542    4.20e- 1
14 dosage                        -0.137   6.21e- 1  -0.679     0.406    6.21e- 1
15 treatment                     -0.161   9.00e- 1  -2.68      2.35     9.00e- 1
16 age                            0.00851 9.15e- 1  -0.148     0.165    9.15e- 1

Based on these models it appears that most variables have an independent correlation with tumor size. This will be illustrated below.

Volcano plot

A volcano plot shows our estimate slope vs the p-value.

(df_long_nested_model |> 
  ggplot(mapping = aes(x = estimate,
                     y = -log10(p_adjusted),
                     color = factor(p_adjusted < 0.05))) +
  geom_point(alpha = 0.9) +
  geom_hline(yintercept = 0)+
  geom_vline(xintercept = 0)+
  best_theme() +
  theme(legend.position = "none") +
  ggrepel::geom_text_repel(aes(label = case_when(p_adjusted < 0.05 ~ variable,
                                                 TRUE ~ ""))) +
  labs(
    title = "Estimate vs -log10 of adjusted p-value",
    y = "-log10 of p-values adjusted for multiple testing",
    x = "estimate")) |>
    save_and_show("linear_volcano")
Saving 7 x 5 in image

Forest plot

The forest plot shows the confidence interval of the modeled slope for all variables

(forest_plot <- df_long_nested_model |> 
  filter(p_adjusted < 0.05) |> 
  ggplot(mapping = aes(x = estimate,
                       y = fct_reorder(variable, estimate),#sort by the slope 
                       xmin = conf.low,
                       xmax = conf.high)) +
  geom_errorbar(orientation = "y") +
  geom_point() +
  geom_vline(xintercept = 0) +
  theme_minimal(base_size = 12) +
  labs(
    x = "Estimates (95% CIs)",
    y = "Variable",
    title = "Variables with significant impact on tumor size"
  ) +
  best_theme()) |>
  save_and_show("linear_forest_plot")
Saving 7 x 5 in image

Residual QQ-plots

Here we seek to validate if the linear models make sense of if there is a depature from expected normality.

df_resids <- df_long_nested_model |> 
  mutate(aug = map(model_object, broom::augment)) |> 
  unnest(aug)
(ggplot(df_resids, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal()) |>
  save_and_show("linear_qq")
Saving 7 x 5 in image

Evident from the QQ-plots are a distinct departure from the normal-distribution in tails. This is present in each model and therefore it is likely that tumor size has outliers. This is investigated following by a histogram and QQ-plot.

(df_num |> 
  ggplot() +
  geom_histogram(aes(x = size_of_primary_tumor), 
                 color = "black", alpha = 0.8, fill = "grey") +
  best_theme()) |> 
save_and_show("tumor_histogram")
Saving 7 x 5 in image
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_bin()`).

(df_num |>  
  ggplot(aes(sample = size_of_primary_tumor)) +
  stat_qq() +
  stat_qq_line(color = "salmon") +
  best_theme()) |>
save_and_show("tumor_qq")
Saving 7 x 5 in image
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_qq()`).
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_qq_line()`).
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_qq()`).
Warning: Removed 5 rows containing non-finite outside the scale range
(`stat_qq_line()`).

These two plots confirm a definite departure from normality, indicative that log-transformed data would make a better fit.

(df_num |>
  filter(size_of_primary_tumor > 0) |>
  ggplot(aes(sample = log(size_of_primary_tumor))) +
  stat_qq() +
  stat_qq_line(color = "salmon") +
  best_theme()) |> 
save_and_show("log_tumor_qq")
Saving 7 x 5 in image

Evident from the model is a much closer normal distribution even if an S-shape is present this is likely due to a few outliers, but overall it reveals a logarithmic scale may suit the data better.

Modelling based on log transformed data

As just confirmed it is to be investigated wether a log-transformation of our data improves correlations. This requires dropping patients with tumor size zero.

df_long_log <- df_long_nested |> 
  unnest(cols = c(data)) |>
  filter(size_of_primary_tumor > 0)

df_long_log |> slice_sample(n = 10)
# A tibble: 160 × 4
# Groups:   variable [16]
   variable patient_no size_of_primary_tumor value
   <chr>         <dbl>                 <dbl> <dbl>
 1 age             232                    62    72
 2 age              75                    47    65
 3 age              10                    21    78
 4 age             189                     4    73
 5 age             454                    17    88
 6 age             191                     7    72
 7 age             505                    32    82
 8 age             233                    25    74
 9 age             212                     3    76
10 age             214                    11    69
# ℹ 150 more rows

Nesting the model objects for cleanliness.

df_long_log_nest <- df_long_log |>
  group_by(variable) |>
  nest()

Model of logarithmic tumor size:

df_long_nested_model_log <- df_long_log_nest |>
  mutate(model_object = map(data,
                            ~lm(log(size_of_primary_tumor) ~ value,
                                data = .x))) |> 
  mutate(tidy_model = map(model_object, ~tidy(.x,
                                              conf.int = TRUE,
                                              conf.level = 0.95))) |> 
  unnest(cols = tidy_model) |> 
  filter(term == "value") |>  #discard intercept terms
  mutate(p_adjusted = p.adjust(p.value)) |> # correct for multiple testing
  arrange(p_adjusted) #sort pvalues to reveal highest significance

df_long_nested_model_log |> 
  select(-data, -model_object, -term, -std.error, -statistic) |>  
  print()
# A tibble: 16 × 6
# Groups:   variable [16]
   variable                      estimate  p.value conf.low conf.high p_adjusted
   <chr>                            <dbl>    <dbl>    <dbl>     <dbl>      <dbl>
 1 combined_index                 1.58e-1 6.12e-14  0.118     0.198     6.12e-14
 2 stage                          4.69e-1 4.28e- 8  0.303     0.634     4.28e- 8
 3 bone_metastases                6.16e-1 8.95e- 8  0.393     0.839     8.95e- 8
 4 serum_prostatic_acid_phospha…  8.46e-3 5.56e- 5  0.00437   0.0126    5.56e- 5
 5 is_alive                       2.95e-1 1.69e- 3  0.111     0.478     1.69e- 3
 6 follow_up_months              -5.63e-3 2.10e- 3 -0.00921  -0.00205   2.10e- 3
 7 serum_hemoglobin              -5.41e-2 1.34e- 2 -0.0969   -0.0113    1.34e- 2
 8 medical_history               -2.11e-1 1.44e- 2 -0.380    -0.0423    1.44e- 2
 9 sys_blood_pressure             1.93e-2 2.73e- 1 -0.0153    0.0540    2.73e- 1
10 dosage                        -2.27e-2 2.85e- 1 -0.0643    0.0190    2.85e- 1
11 dia_blood_pressure            -2.73e-2 3.50e- 1 -0.0848    0.0301    3.50e- 1
12 ekg                            1.15e-2 4.45e- 1 -0.0180    0.0409    4.45e- 1
13 performance                    7.31e-2 4.68e- 1 -0.125     0.271     4.68e- 1
14 treatment                     -3.26e-2 7.40e- 1 -0.225     0.160     7.40e- 1
15 age                           -1.40e-3 8.18e- 1 -0.0133    0.0105    8.18e- 1
16 weight                         1.01e-4 9.75e- 1 -0.00624   0.00644   9.75e- 1

If we compare these p-values with earlier linear model we see a much more significant correlation between these and log-transformed tumor size. This will be validated with residual QQ-plots.

Preparing for residual plots

df_resids_log <- df_long_nested_model_log |> 
  mutate(aug = map(model_object, 
                   broom::augment)) |> 
  unnest(aug)
(ggplot(df_resids_log, aes(sample = .resid)) +
  stat_qq() +
  stat_qq_line() +
  facet_wrap(~ variable, scales = "free") +
  theme_minimal()) |> 
save_and_show("log_qq")
Saving 7 x 5 in image

Aligning with our log-transformed tumor data the variables now also show a smaller from expected expectation normality.

Volcano plot

A volcano plot shows our estimate slope vs the p-value.

(
df_long_nested_model_log |> 
  ggplot(mapping = aes(x = estimate,
                     y = -log10(p_adjusted),
                     color = factor(p_adjusted < 0.05))) +
  geom_point(alpha = 0.9) +
  geom_hline(yintercept = 0)+
  geom_vline(xintercept = 0)+
  best_theme() +
  theme(legend.position = "none") +
  ggrepel::geom_text_repel(aes(label = case_when(p_adjusted < 0.05 ~ variable,
                                                 TRUE ~ ""))) +
  labs(
    title = "Estimate vs -log10 of adjusted p-value",
    y = "-log10 of p-values adjusted for multiple testing",
    x = "estimate")) |> 
save_and_show(plot_name = "log_volcano")
Saving 7 x 5 in image

Forest plot

The forest plot shows the confidence interval of the modelled slope for all variables

(df_long_nested_model_log |> 
  filter(p_adjusted < 0.05) |> 
  ggplot(mapping = aes(x = estimate,
                       y = fct_reorder(variable, estimate),#sort byr the skope
                       xmin = conf.low,
                       xmax = conf.high)) +
  geom_errorbar(orientation = "y") +
  geom_point() +
  geom_vline(xintercept = 0) +
  theme_minimal(base_size = 12) +
  labs(
    x = "Estimates (95% CIs)",
    y = "Variable",
    title = "Variables with significant impact on tumor size"
  ) +
  best_theme()) |> 
save_and_show("log_forest")
Saving 7 x 5 in image

Correlation among continuous clinical variables

Beyond comparison with tumor size a pairwise correlation for the quantitative biological and physiological measurements is computed to identify potential collinearity before survival analysis.

# Select relevant numeric variables
corr_df <- df_num |>
  select(
    age,
    weight,
    serum_hemoglobin,
    serum_prostatic_acid_phosphatase,
    size_of_primary_tumor)

# Compute raw correlation matrix
corr_mat <- cor(corr_df, use = "pairwise.complete.obs")

Correlation heatmap

# Convert matrix to long format for ggplot2
corr_long <- as.data.frame(corr_mat) |>
  rownames_to_column("Var1") |>
  pivot_longer(
    cols = -Var1,
    names_to = "Var2",
    values_to = "cor"
  )

corr_long |>
  ggplot(aes(x = Var2, y = Var1, fill = cor)) +
  geom_tile(color = "white") +
  geom_text(aes(label = sprintf("%.2f", cor)), size = 3) +
  scale_fill_gradient2(
    low = "blue",
    mid = "white",
    high = "red",
    midpoint = 0,
    limits = c(-1, 1),
    name = "Correlation"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1), # retate words
    panel.grid = element_blank() # remove grid lines
  ) +
  labs(
    title = "Correlation Matrix of Selected Variables",
    x = NULL,
    y = NULL
  )

This correlation matrix helps to identify redundant predictors. All the relevances is below 0.8, which indicates the selected variables can be seen as independent and used in a survival analysis. Given previous analysis it can also be concluded that shown variables

Exploratory multivariate analysis

Expanding on colliniarity and variables that have been shown significant in relation to log(size_of_primary_tumor) a few multivariate models are explored.

model_tidy <- lm(log(size_of_primary_tumor) ~ 
                   serum_hemoglobin + 
                   medical_history + 
                   bone_metastases + 
                   serum_prostatic_acid_phosphatase,
                 data = df_num |>
                   filter(size_of_primary_tumor > 0)) |> 
  tidy()
print(model_tidy)
# A tibble: 5 × 5
  term                             estimate std.error statistic  p.value
  <chr>                               <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)                       2.57      0.310       8.30  1.09e-15
2 serum_hemoglobin                 -0.0206    0.0221     -0.930 3.53e- 1
3 medical_history                  -0.187     0.0837     -2.24  2.57e- 2
4 bone_metastases                   0.463     0.126       3.68  2.62e- 4
5 serum_prostatic_acid_phosphatase  0.00473   0.00225     2.10  3.61e- 2

Logistic regression predicting survivability (bad results)

A binary logistic regression is perfomed on the is_alive column. This is of limited biological relevanc but may reveal an interesting insight. interesting cols: select(is_alive, age, weight, size_of_primary_tumor,serum_prostatic_acid_phosphatase, performance)

# Turn df is_alive column into numeric variable for modelling:
df_a_ts <- df_num |>
  select(is_alive, size_of_primary_tumor) |> 
  drop_na()


df_a_ts |>
  ggplot(aes(y = is_alive,
             x = size_of_primary_tumor)) +
  geom_point() +
  labs(
    x = "Size of primary tumor",
    y = "Alive at time of study (1: YES, 0: NO)",
    title = "Being alive vs tumor size"
  ) +
  best_theme()

There seems to be a threshold for tumor size vs being alive at the time of study.

Fitting the logistic regression

my_glm_mdl <- glm(formula = is_alive ~ size_of_primary_tumor,
                  family = binomial(link = "logit"),
                  data = df_num)
my_glm_mdl

Call:  glm(formula = is_alive ~ size_of_primary_tumor, family = binomial(link = "logit"), 
    data = df_num)

Coefficients:
          (Intercept)  size_of_primary_tumor  
               0.4333                 0.0339  

Degrees of Freedom: 490 Total (i.e. Null);  489 Residual
  (5 observations deleted due to missingness)
Null Deviance:      592.4 
Residual Deviance: 578.1    AIC: 582.1
df_a_ts |> 
  mutate(my_glm_model = pluck(my_glm_mdl, "fitted.values")) |> 
  ggplot(aes(x = size_of_primary_tumor, y = is_alive)) +
  geom_point(alpha = 0.5,
             size = 3) +
  geom_line(aes(y = my_glm_model)) +
  labs(
    x = "Size of primary tumor",
    y = "Alive at time of study (0: YES, 1: NO)",
    title = "Being alive vs tumor size"
  ) +
  best_theme()

As mentioned there seems to be a certain correlation between these two but it is not a very telling analysis so it should not be considered.

Multivariate on measurements – No significant conclusions

Mentioned preivously this is an exploration of blood and treatment data to evaluate whether it could have an explanatory effect. This is conducted due to QQ-plots showing some variation and S-tailing. First explored is the tumor size, and potential effects of treatment, AP and hemoglobin.

ap_ts_log_tidy_model <- lm(
  log(size_of_primary_tumor) ~ 
    log(serum_prostatic_acid_phosphatase) +
    serum_prostatic_acid_phosphatase +
    serum_hemoglobin + 
    dosage + 
    bone_metastases +
    stage,
  data = df_num |> 
    filter(size_of_primary_tumor > 0))

print(ap_ts_log_tidy_model |>
        tidy())
# A tibble: 7 × 5
  term                                  estimate std.error statistic     p.value
  <chr>                                    <dbl>     <dbl>     <dbl>       <dbl>
1 (Intercept)                            3.06      0.555       5.50      6.05e-8
2 log(serum_prostatic_acid_phosphatase)  0.242     0.0543      4.46      1.01e-5
3 serum_prostatic_acid_phosphatase      -0.00546   0.00298    -1.83      6.76e-2
4 serum_hemoglobin                      -0.0268    0.0217     -1.24      2.17e-1
5 dosage                                -0.0297    0.0200     -1.48      1.39e-1
6 bone_metastases                        0.261     0.136       1.92      5.51e-2
7 stage                                 -0.118     0.132      -0.892     3.73e-1

Disregarding non-significant values:

ap_ts_log_tidy_model |> 
  tidy() |>
  mutate(padj = p.adjust(p.value, method = "fdr")) |>
  filter(padj < 0.05) |>
  save_and_show_table("log_tumor_sig_fit")
# A tibble: 2 × 6
  term                              estimate std.error statistic p.value    padj
  <chr>                                <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
1 (Intercept)                          3.06     0.555       5.50 6.05e-8 4.23e-7
2 log(serum_prostatic_acid_phospha…    0.242    0.0543      4.46 1.01e-5 3.55e-5

Seems that log(AP) does have a significant correlation with tumor size but not more than the log tumor size.

Exploratory analysis of effects on prostatic acid phosphatase

ap_es_log_tidy_model <- lm(
  serum_prostatic_acid_phosphatase ~ 
    dosage +
    I(dosage^2) +
    I(dosage^2)*serum_hemoglobin +
    bone_metastases*dosage +
    log(size_of_primary_tumor),
  data = df_num |> filter(size_of_primary_tumor > 0))

print(ap_es_log_tidy_model |> tidy())
# A tibble: 8 × 5
  term                         estimate std.error statistic       p.value
  <chr>                           <dbl>     <dbl>     <dbl>         <dbl>
1 (Intercept)                    9.16      7.55      1.21   0.226        
2 dosage                         0.259     2.77      0.0933 0.926        
3 I(dosage^2)                    1.30      0.802     1.62   0.106        
4 serum_hemoglobin              -0.794     0.516    -1.54   0.124        
5 bone_metastases               19.6       3.18      6.16   0.00000000156
6 log(size_of_primary_tumor)     2.00      0.918     2.17   0.0302       
7 I(dosage^2):serum_hemoglobin  -0.0951    0.0430   -2.21   0.0274       
8 dosage:bone_metastases        -0.341     1.15     -0.298  0.766        

Again filtering for non-significant contributors.

ap_es_log_tidy_model |> 
  tidy() |>
  mutate(padj = p.adjust(p.value, method = "fdr")) |>
  filter(padj < 0.05) |>
  print()
# A tibble: 1 × 6
  term            estimate std.error statistic       p.value         padj
  <chr>              <dbl>     <dbl>     <dbl>         <dbl>        <dbl>
1 bone_metastases     19.6      3.18      6.16 0.00000000156 0.0000000125

We see the primary contributors to the AP levels to be bone metastasis, a relevant conclusion as this is prevalent in advanced diagnosis.

Exploratory analysis of effects on bone metastasis

We wanted to investigate the correlation between bone metastatis and blood factors.

bm_ap_hb_model <- lm(
  bone_metastases ~
    serum_prostatic_acid_phosphatase*I(dosage**2) +
    serum_prostatic_acid_phosphatase*serum_hemoglobin +
    dosage*serum_hemoglobin,
  data = df_num)

print(bm_ap_hb_model |> tidy())
# A tibble: 8 × 5
  term                                      estimate std.error statistic p.value
  <chr>                                        <dbl>     <dbl>     <dbl>   <dbl>
1 (Intercept)                                6.78e-1 0.129         5.25  2.23e-7
2 serum_prostatic_acid_phosphatase          -1.22e-2 0.00363      -3.35  8.75e-4
3 I(dosage^2)                               -1.87e-2 0.00902      -2.08  3.83e-2
4 serum_hemoglobin                          -4.68e-2 0.00950      -4.92  1.18e-6
5 dosage                                     9.47e-2 0.0695        1.36  1.74e-1
6 serum_prostatic_acid_phosphatase:I(dosag… -2.66e-5 0.0000654    -0.407 6.84e-1
7 serum_prostatic_acid_phosphatase:serum_h…  1.81e-3 0.000336      5.38  1.14e-7
8 serum_hemoglobin:dosage                    9.84e-4 0.00391       0.252 8.01e-1
bm_ap_hb_model |> 
  tidy() |>
  mutate(padj = p.adjust(p.value, method = "fdr")) |>
  filter(padj < 0.05) |>
  save_and_show_table("bone_sig_fit")
# A tibble: 4 × 6
  term                              estimate std.error statistic p.value    padj
  <chr>                                <dbl>     <dbl>     <dbl>   <dbl>   <dbl>
1 (Intercept)                        0.678    0.129         5.25 2.23e-7 8.90e-7
2 serum_prostatic_acid_phosphatase  -0.0122   0.00363      -3.35 8.75e-4 1.75e-3
3 serum_hemoglobin                  -0.0468   0.00950      -4.92 1.18e-6 3.14e-6
4 serum_prostatic_acid_phosphatase…  0.00181  0.000336      5.38 1.14e-7 8.90e-7

Here we can confirm that bone metastasis is indeed influenced by both AP and hemoglobin levels but not estrogen treatment dosage.

AP levels by bone metastasis status

As just shown serum AP is commonly elevated in advanced prostate cancer and is often used as a surrogate marker for metastatic activity. We compare AP levels between patients with and without bone metastasis and draw a boxplot to show the result.

df_num |>
  ggplot(aes(
    x = factor(bone_metastases), 
    y = serum_prostatic_acid_phosphatase)) +
  geom_boxplot() +
  scale_y_log10() + # for log scale
  labs(
    x = "Bone metastases",
    y = "AP (log scale)",
    title = "AP levels (log scale) by bone metastases status"
  )+
  theme(plot.title = element_text(hjust = 0.5) # keep titles in the middle
  )

We can see that AP levels are expected to be substantially higher in patients with bone metastases, reinforcing AP as a marker of advanced disease. This observation is consistent with the extreme AP values identified earlier in 04_describe.

Interpretation

In summary, several variables have been confirmed to be significantly associated with tumor size, and therefore with disease severity. In addition, correlations between key variables such as metastasis and AP levels have been confirmed. Along with this confirmation correlations for use in subsequent survival analysis have been confirmed. Blood-derived factors showed the expected association with the presence of bone metastases, but they did not show significant correlation with each other, nor with treatment.